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Abstract 

Background: Strains of Mycobacterium tuberculosis complex (MTBC) can be classified into major lineages based on 
their genotype. Further subdivision of major lineages into sublineages requires multiple biomarkers along with 
methods to combine and analyze multiple sources of information in one unsupervised learning model. Typically, 
spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU) are used for TB 
genotyping and surveillance. Here, we examine the sublineage structure of MTBC strains with multiple biomarkers 
simultaneously, by employing a tensor clustering framework (TCF) on multiple-biomarker tensors. 

Results: Simultaneous analysis of the spoligotype and MIRU type of strains using TCF on multiple-biomarker 
tensors leads to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. 
Comparison of tensor sublineages with SpolDB4 families either supports tensor sublineages, or suggests 
subdivision or merging of SpolDB4 families. High prediction accuracy of major lineage classification with supervised 
tensor learning on multiple-biomarker tensors validates our unsupervised analysis of sublineages on multiple- 
biomarker tensors. 

Conclusions: TCF on multiple-biomarker tensors achieves simultaneous analysis of multiple biomarkers and 
suggest a new putative sublineage structure for each major lineage. Analysis of multiple-biomarker tensors gives 
insight into the sublineage structure of MTBC at the genomic level. 



Background 

Tuberculosis (TB), a bacterial disease caused by Myco- 
bacterium tuberculosis complex (MTBC), is a leading 
cause of death worldwide. In the United States, isolates 
from all TB patients are routinely genotyped by multiple 
biomarkers. The biomarkers include Spacer Oligonu- 
cleotide Types (spoligotypes), Mycobacterial Inter- 
spersed Repetitive Units - Variable Number Tandem 
Repeats (MIRU-VNTR), IS6110 Restriction Fragment 
Length Polymorphisms (RFLP), Long Sequence Poly- 
morphisms (LSPs), and Single Nucleotide Polymorph- 
isms (SNPs). 
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Genotyping of MTBC is used to identify and distin- 
guish MTBC into distinct lineages and/or sublineages 
that are quite useful for TB tracking, TB control, and 
examining host-pathogen relationships [1]. The six main 
major lineages of MTBC are M. africanum, M. bovis, M. 
tuberculosis subgroup Indo-Oceanic, M. tuberculosis 
subgroup Euro-American, M. tuberculosis subgroup East 
Asian (Beijing) and M. tuberculosis subgroup East- 
African Indian (CAS). Other major lineages exist such 
as M. canettii and M. microti, but they do not com- 
monly occur in the US, so we do not consider them 
here. These major lineages can be definitively character- 
ized using LSPs [2], but typically only spoligotypes and 
MIRU are collected for the purpose of TB surveillance. 
Classification, similarity search, and expert-rule based 
methods have been developed to correctly map isolates 
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genotyped using MIRU and/or spoligotypes to the major 
lineages [3-5]. 

While sublineages of MTBC are routinely used in the 
TB literature, their exact definitions, names, and num- 
bers have not been clearly established. The SpolDB4 
database contains 39,295 strains and their spoligotypes 
with the vast majority of them labeled and classified 
into 62 sublineages [6], but many of these are consid- 
ered to be "potentially phylogeographically-specific 
MTBC genotype families", rather than distinct phylo- 
genetic sublineages with known biomarkers. Therefore, 
further analysis is needed to confirm these sublineages. 
The highly-curated MIRU -VNT 'Rplus website, which 
focuses primarily on MIRU, defines 22 sublineages. 
New definitions of sublineages based on LSPs and 
SNPs are being discovered; e.g. the RD724 polymorph- 
ism corresponds to the previously defined SpolDB4 T2 
sublineage, also known as the Uganda strain in MIRU- 
VNTRplus[7]. Now large databases using spoligotype, 
MIRU patterns, and RFLP exist. The United States 
Centers for Disease Control and Prevention (CDC) has 
gathered spoligotypes and MIRU isolates for over 
37,000 patients. Well-defined TB sublineages based on 
spoligotype and MIRU are critical for both TB control 
and TB research. 

The goal of this paper is to examine the sublineage 
structure of MTBC on the basis of multiple biomarkers. 
The proposed method reveals structure not captured in 
SpolDB4 spoligotype families because SpolDB4 subline- 
age only take into account a single biomarker, spoligo- 
types. A spoligotype-only tool, SPOTCLUST, was used 
to find MTBC sublineages using an unsupervised prob- 
abilistic model, reflecting spoligotype evolution [8], A 
key issue is to combine spoligotype and MIRU into a 
single unsupervised learning model. When MIRU pat- 
terns are considered, SpolDB4 families that are well-sup- 
ported by spoligotype signatures may become 
ambiguous, or allow subdivision/ merging of the families. 
Existing phylogenetic methods can be readily applied to 
MIRU patterns, but specialized methods are needed to 
accurately capture how spoligotypes evolve. It is not 
known how to best combine spoligotype and MIRU pat- 
terns to infer a phylogeny. The online tool www.MIR- 
UVNTRplus.org determines lineages by using similarity 
search to a labeled database. The user must select the 
distance measure which is defined using spoligotypes 
and/or MIRU patterns, possibly yielding different results. 

In this study, we develop a tensor clustering frame- 
work to find the sublineage structure of MTBC strains 
labeled by major lineages based on multiple biomarkers. 
This is an unsupervised learning problem. We generate 
multiple-biomarker tensors of MTBC strains for each 
major lineage and apply multiway models for dimen- 
sionality reduction. The model accurately captures 



spoligotype evolutionary dynamics using contiguous 
deletions of spacers. The tensor transforms spoligotypes 
and MIRU into a new representation, where traditional 
clustering methods apply without users having to decide 
a priori how to combine spoligotype and MIRU pat- 
terns. Strains are clustered based on the transformed 
data without using any information from SpolDB4 
families. Clustering results lead to the subdivision of 
major lineages of MTBC into groups with clear and dis- 
tinguishable spoligotype and MIRU signatures. Compari- 
son of the tensor sublineages with SpolDB4 families 
suggests dividing or merging some SpolDB4 families. As 
a way of validating multiple-biomarker tensors, we use 
them in a supervised learning model to predict major 
lineages using spoligotype deletions and MIRU. We 
compare the prediction accuracy of the multiple-bio- 
marker tensor model created with N-PLS (N-way partial 
least squares) with the 2-way PLS applied to matrix data 
and an existing conformal Bayesian Network approach. 

In the next section, we give a brief background on 
clustering and multiway analysis of post-genomic data, 
spoligotyping, and MIRU typing. 

Clustering post-genomic data 

Data clustering is a class of techniques for unsupervised 
classification of data samples into groups of similar 
behavior, function, or trait [9]. Clustering can be used in 
post-genomic data analysis to group strains with similar 
traits. It is common practice to use different clustering 
methods and use a priori biological knowledge to inter- 
pret the clusters, but computational cluster validation is 
needed to validate results without prior knowledge for 
unsupervised classification. A great survey by Handl et 
al. outlines the steps of computational cluster analysis 
on post-genomic data [10]. An application of computa- 
tional cluster validation on microarray data by Giancarlo 
et al. compares the results of clusterings using various 
cluster validation indices [11]. Eisen et al. clusters gene 
expression data which groups genes of similar functions 
[12]. Improved clustering techniques have been devel- 
oped, but how to combine multiple sources of informa- 
tion in one clustering is an open question. 

Application of multiway models to post-genomic 
data clustering 

Clustering on post-genomic data can be accomplished 
based on multiple sources of ground truth. The ground 
truth can be based on multiple biomarkers, host and 
pathogen, or antigen and antibody. A survey by Kriegel 
et al. outlines the methods for finding clusters in high- 
dimensional data [13]. Analysis of multiway arrays for 
data mining is frequently used today in various fields, 
including bioinformatics, to use multiple sources of 
prior information simultaneously [14]. Alter and Golub 
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use higher-order eigenvalue decomposition on a net- 
works x genes x genes tensor and find significant subnet- 
works associated with independent pathways in a 
genome-scale network of relations among all genes of 
cellular systems [15]. Omberg et al. use higher-order 
singular value decomposition on DNA microarray data, 
obtaining the core tensor of eigenarrays x x-eigengenes 
x y-eigengenes and finding correlation between genomes 
in the subtensors of the core tensor [16]. Multiway ana- 
lysis of EEG data identifies epileptic seizures [17]. Use 
of common partitive and hierarchical clustering algo- 
rithms accompanied with multiway modeling of high- 
dimensional data finds functionally related genes in 
stem cells [18]. Similarly, multiple biomarkers of the 
MTBC genome can be used to cluster MTBC strains. 

Spoligotyping 

Spoligotyping is a DNA fingerprinting method that 
exploits the polymorphisms in the direct repeat (DR) 
region of the MTBC genome. The DR region is a poly- 
morphic locus in the genome of MTBC which consists 
of direct repeats (36 bp), separated by unique spacer 
sequences of 36 to 41 bp [19]. The method uses 43 
spacers, thus a spoligotype is typically represented by a 
43-bit binary sequence. Zeros and ones in the sequence 
correspond to the absence and presence of spacers 
respectively. Mutations in the DR region involve dele- 
tion of one or more contiguous spacers. To capture this 
mechanism of mutation in our model, we find informa- 
tive contiguous spacer deletions and represent spoligo- 
type deletions as a binary vector, where one indicates 
that a specific contiguous deletion occurs (i.e. a specified 
contiguous set of spacers are all absent) and zero means 
at least one spacer is present in that contiguous set of 
spacers. 

Large datasets of MTBC strains genotyped by spoligo- 
type have been amassed such as SpolDB4 [6] and a 
more extended online version SITVIT (http://www.pas- 
teur-guadeloupe.fr:808 1 /SIT VITDemo/index.jsp) . Spoli- 
gotypes can be readily used to identify commonly 
accepted major lineages of MTBC with high accuracy 
[4]. SpolDB4 defined a set of phylogeographic subli- 
neages or families based on expert derived rules that are 
in common use in the TB community. In contrast to 
the major lineages that have been validated by more 
definitive markers such as single nucleotide polymorph- 
isms and long sequence polymorphism, the exact defini- 
tion of MTBC sublineages and the accuracy of the 
SpolDB4 families created only using spoligotypes remain 
open questions. 

MIRU-VNTR typing 

MIRU is a homologous 46-100 bp DNA sequence dis- 
persed within intergenic regions of MTBC, often as 



tandem repeats. MIRU-VNTR typing is based on the 
number of tandem repeats of MIRUs at certain identi- 
fied loci. Among these 41 identified mini-satellite 
regions on the MTBC genome, different subsets of sizes 
12, 15, and 24 are proposed for the standardization of 
MIRU-VNTR typing [3]. In this study, we use 12 MIRU 
loci for genotyping MTBC. Thus, the MIRU pattern is 
represented as a vector of length 12, each entry repre- 
senting the number of repeats in each MIRU locus. 

Results 

We used the tensor clustering framework to cluster 
MTBC strains using multiple biomarkers, and compared 
the clustering to SpolDB4 sublineages. Next, we used 
supervised tensor learning and classified MTBC strains 
into major lineages using spoligoype deletions and 
MIRU patterns. We compared multiway and two-way 
supervised learning methods based on their prediction 
accuracy for major lineage classification. In the following 
section, we introduce multiple-biomarker tensors and 
present unsupervised and supervised learning experi- 
ments on multiple-biomarker tensors. 

Multiple-biomarker tensor analysis of strain data 

Multiple biomarkers of the MTBC genome in a rela- 
tional database can be represented as a high-dimen- 
sional dataset for multiway analysis. The multiple- 
biomarker tensor is constructed this way, with one of 
the modes representing strains and other modes repre- 
senting biomarkers. In our experiments, we use this 
multidimensional array or tensor with three modes 
representing strains, spoligotype deletions, and MIRU 
patterns. This multiple-biomarker tensor captures three 
key properties of MTBC strains: spoligotype deletions, 
number of repeats in MIRU loci, and coexistence of 
spoligotype deletions with MIRU loci. 

The strain dataset is arranged as a three-way array 
with strains in the first mode, spoligotype deletions in 
the second mode, and MIRU patterns in the third 
mode. Each entry X(i, /', k) in the tensor corresponds to 
the number of repeats in MIRU locus k of strain i with 
spoligotype deletion /. If spoligotype deletion / does not 
exist in strain i, then the tensor entry X(z',/,.) is 0. Thus, 
strain datasets are formed as Strains x Spoligotype dele- 
tions x MIRU patterns tensors, as shown in Figure 1. 
Mathematically, each strain is represented as the outer 
product of the binary spoligotype deletion vector and 
the MIRU pattern vector, which results in a biomarker 
kernel matrix. Biomarker kernel matrices of the same 
size for each strain form the multiple-biomarker tensor. 
Generation of the multiple-biomarker tensor from 
biomarkers of each strain is shown in Figure 2. We 
represent spoligotype deletions with a binary vector , 
where S; e {0,1}, i e {!,..,«}, and n is the number of 
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Strains 
i=l,..,l 




MIRU patterns 
k=l,..,K 



Spoligotype 
deletions 
j=l,..,J 

Figure 1 Multiple-biomarker tensor Strains x Spoligotype deletions x MIRU patterns tensor. Each entry X(/,y, k) of the tensor represents the 
number of repeats in MIRU locus k of strain /' with spoligotype deletion / 



informative spoligotype deletions found using the fea- 
ture selection algorithm, detailed in the methods sec- 
tion. We represent 12-loci MIRU with a digit vector , 
where ra,- e {I, ..,9, > 9} and / e {1,..,12}. The entries of 
the multiple-biomarker tensor which combines spoligo- 
type and MIRU information can be formulated as: 



where 




0, if spoligotype deletion; does not occur in strain i, 

1, if spoligotype deletion; occurs in strain i. 



and r ik is the number of repeats in MIRU locus k of 
strain i. Multiple-biomarker tensors can be used for 
both unsupervised and supervised learning. Next, we use 



=^ Strains k£ — \ 



m 



m 




Figure 2 Generation of multiple-biomarker tensor Biomarker kernel matrix for each strain forms multiple-biomarker tensor. Vector 5 
represents spoligotype deletions and represents MIRU patterns. 
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the unsupervised tensor clustering framework on multi- 
ple-biomarker tensors to subdivide major lineages of 
MTBC into sublineages. 
Subdivision of major lineages into sublineages 
We subdivide each major lineage of MTBC into subli- 
neages using multiple-biomarker tensors. For each 
major lineage, we generated the multiple-biomarker ten- 
sor using spoligotypes and MIRU types and applied 
multiway models to identify putative sublineages of each 
major lineage. Two multiway analysis methods were 
used: PARAFAC and Tucker3. Details of the methods 
and how the model parameters or components were 
selected can be found in the methods section. The vali- 
dated multiway models with numbers of components 
for each major lineage are shown in Table 1. To evalu- 
ate the resulting clusters, we compared them to the 
published SpolDB4 families for each major lineage. The 
results are summarized in Table 2. We used the F-mea- 
sure to measure how well the tensor sublineages match 
the SpolDB4 families with 1 indicating an exact match 
and 0 indicating no match. The average best-match sta- 
bility is used to assess certainty of tensor sublineages 
respectively with 1 indicating highly stable clusters. For 
each major lineage, results show that the tensor analysis 
finds highly stable sublineages (the best-match stability 
is >84%) and that the number of sublineages found 
using tensors is close but not always identical to the 
number of SpolDB4 families. 

The F-measure values range from 53% to 88% indicat- 
ing that the sublineages found by the tensors only par- 
tially overlap with those of SpolDB4. Recall that the 
SpolDB4 families were created by expert analysis using 
only spoligotypes and that analysis by alternative bio- 
markers such as SNP and LSP has led to alternative 
definitions of MTBC sublineages. The tensor sublineages 
are based on spoligotype and MIRU patterns, thus in 
some cases the tensor divides SpolDB4 families due to 
difference in MIRU patterns even if the spoligotypes 
match. In other cases, the tensor analysis merges the 
SpolDB4 families because the collective spoligotypes and 



MIRU patterns are very close. In some cases, the tensor 
analysis almost exactly reproduces a SpolDB4 family 
providing strong support for the existence of these 
families with no expert guidance. In addition, the MIRU 
patterns provide additional evidence for the existence of 
these distinct sublineages. Thus, multiway analysis of 
MTBC strains of each major lineage with multiple bio- 
markers leads to new sublineages and reaffirms existing 
ones. Further insight can be obtained by examining the 
putative sublineages for each major lineage, which is 
detailed next. 

Sublineage structure of M. africanum The most 
stable clusters were produced using PARAFAC and it 
constructed four putative sublineages of M. africanum, 
denoted MAI to MA4. Table 3 gives the stability of 
each sublineage and the correspondence between the 
tensor sublineages and the SpolDB4 families. These four 
putative sublineages are quite distinct as shown by the 
stability of 1 for each sublineage and the clear separa- 
tion of the four sublineages in the PCA plot in Figure 3. 
Figure 4 shows heat maps representing the spoligotype 
and MIRU signatures for each tensor sublineage, with 
white indicating 0 probability and black indicating prob- 
ability of 1. 

The tensor sublineages strongly support the existence 
of the SpolDB4 AFRI_1, AFRI_2 and AFRI_3 families 
and show that the AFRI family is composed of these 
three families. With an F-measure of 66%, the tensor 
sublineages differ markedly from the SpolDB4 families 
for the M. africanum lineage. The AFRI family results 
largely explain this difference - AFRI is spread across 
three tensor sublineages. Disregarding AFRI, sublineages 
MA2 and MA3 match families AFRI_2 and AFRI_3 
respectively. Interestingly, AFRI l is further subdivided 
into sublineages MAI and MA4. The spoligotypes in 
MAI and MA4 differ by only one contiguous deletion 
of spacers 22 through 24, but their MIRU signatures 
clearly distinguish them especially in MIRU loci 10, 16 
and 40. The tensor indicates that the AFRI sublineage 
classification defines somewhat generic M. africanum 



Table 1 Number of components used in PARAFAC and Tucker3 models. 











PARAFAC 


Tucker3 










# Components 


Core Consistency / Variance 


# Components 


Variance 


M. africanum 


64 x 22 x ' 


2 


3 


95.08 / 93.33 


[4 3 1] 


91.94 


M. bovis 


1 02 x 34 x 


12 


2 


1 00.00 / 86.02 


[7 5 1] 


91.05 


East Asian (Beijing) 


571 x 5 x ' 


2 


2 


100.00 / 81.58 


[3 4 2] 


93.09 


East-African Indian (CAS) 


508 x 18 x 


12 


3 


90.75 / 80.48 


[6 6 4] 


94.27 


Indo-Oceanic 


1023 x 28 x 


12 


5 


92.99 / 80.35 


[15 13 5] 


95.55 


Euro-American 


4580 x 109 X 


12 


14 


99.06 / 89.83 


[14 13 5] 


39.77 



Number of components used in PARAFAC and Tucker3 models to fit the tensors for the datasets to be clustered. We used the core consistency diagnostic to 
validate PARAFAC models and percentage of explained variance to validate Tucker3 models. 
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Table 2 Number of SpolDB4 families and number of tensor sublineages for each major lineage 



Major Lineage 


# SpolDB4 families 


# Tensor sublineages 


F-measure 


Best-match stability 


M. africanum 


4 


4 


0.66 


1 


M. bovis 


5 


3 


0.71 


1 


East Asian (Beijing) 


2 


6 


0.88 


1 


East-African Indian (CAS) 


4 


4 


0.75 


1 


Indo-Oceanic 


13 


9 


0.67 


0.86 


Euro-American 


33 


35 


0.53 


0.84 



F-measure and average best-match stability are used to assess the agreement of the tensor sublineages to the SpolDB4 lineages and certainty of tensor 
sublineages respectively. 



strains that can be distinctly placed in the groups MAI 
(part of AFRI_1), MA4 (other part of AFRI_1), MA2 
(AFRI_2) and MA3 (AFRI_3). 

The MIRU-VNTR/?/m5 labels, determined on the basis 
of LSPs, indicate that there are two sublineages, West 
African 1 and West African 2, within M. africanum. 
Table 4 indicates the correspondence between the ten- 
sor sublineages and MIRU-VNTRj?/m.s labels. MAI and 
MA4 correspond to West African 2 and MA2 corre- 
sponds to West African 1. There is no data labeled by 
MIRU-VNTR^/ws in MA3, but we speculate that it is 
West African 1 since MA2 and MA3 have more closely 
related MIRU and spoligotype signatures. 

Sublineage structure of M. bovis PARAFAC gener- 
ated the most stable clusters and constructed 3 sublin- 
eages for M. bovis, MB1, MB2, and MB3, while the data- 
set contains 5 SpolDB4 families, BOV, BOVIS1, 
BOVISl_BCG, BOVIS2, and BOVIS3. Table 5 gives the 
correspondence between the tensor sublineages and the 
SpolDB4 families. All clusters have perfect stability and 
are well distinguished in the PCA plot in Figure 5. 
Figure 6 shows heat maps representing the spoligotype 
and MIRU type signatures of tensor sublineages. Much 
like the M. africanum SpolDB4 AFRI family, the BOV 
family defines a generic M. bovis sublineage that spreads 
across all three tensor sublineages. Disregarding BOV, 
MB3 consists of all of BOVIS1 and BOVISl_BCG 



Table 3 Confusion matrix of M. africanum strains 





MA1 


MA2 


MA3 


MA4 


Stability 1111 


AFRI 


2 


5 


1 


0 


AFRM 


21 


0 


0 


16 


AFRI_2 


0 


12 


0 


0 


AFRI_3 


0 


1 


6 


0 



Confusion matrix for 64 distinct M. africanum strains showing the 
correspondence between the SpolDB4 families and tensor sublineages. The 
stability of each tensor sublineage is given in the second row. All four M. 
africanum sublineages have a stability of 1, indicating that clear and distinct 
genetic diversity exists between the M. africanum sublineages. Each number 
in the table represents the number of strains that belong to associated 
SpolDB4 lineage in that row and associated tensor sublineage in that column. 



strains. Since BOVIS 1_BCG is the attenuated bacillus 
Calmette-Guerin (BCG) vaccine strain, it is difficult to 
distinguish it from BOVIS 1 using only MIRU patterns 
and spoligotypes. Therefore, the merger of BOVIS1 and 
BOVIS 1 BCG is expected given the genetic similarity 
between the two groups of strains. Disregarding BOV, 
the MB1 and MB2 sublineages exactly match the 
SpolDB4 families BOVIS2 and BOVIS3 respectively. 

Sublineage structure of East Asian (Beijing) The 
most stable clusters are produced by Tucker3 and it 
constructs six distinct sublineages of East Asian (Beij- 
ing), denoted Bl through B6. The variability in the spo- 
ligotypes of East Asian is limited to spacers 35 through 
43 since all East Asian strains have spacers 1 to 34 
absent. Since the SpolDB4 classification is based only on 
spoligotypes, the limited variability allows only two 
families, BEIJING and BEIJING-LIKE. Table 6 shows the 
correspondence between tensor sublin-eages and the 
SpolDB4 families. The clustering plot of tensor 
sublineages is shown in Figure 7. Heat maps represent- 
ing the spoligotype and MIRU type signatures of tensor 
sublineages are shown in Figure 8. The tensor cleanly 
subdivides BEIJING into three sublineages Bl, B4 and 
B6, all with stability 1. Spoligotype signatures of these 
sublineages differ. Bl strains have spacers 35 through 43 
present, whereas B4 strains lack spacer 37, and B6 
strains lack spacer 40. MIRU signature of sublineage B4 
is clearly distinct in MIRU locus 40, having 3 repeats for 
most strains. The tensor subdivides the BEIJING-LIKE 
into sublineages B2, B3 and B5, each with distinct spoli- 
gotype signature. They all lack spacers 35 through 36. 
In addition, B2 strains lack spacer 37, and B3 strains 
lack spacer 40. Thus, the tensor strongly supports the 
existence of BEIJING and BEIJING-LIKE families, but 
also suggests that they can be further subdivided. 

Sublineage structure of East- African Indian (CAS) 
Tucker3 generated the most stable clusters and it con- 
structed four distinct sublineages for East- African Indian 
(also known as CAS) denoted CI, C2, C3, and C4. The 
strains are also labeled with four SpolDB4 lineages: 
CAS, CAS1_DELHI, CAS1_KILI and CAS2. Table 7 
shows the correspondence of tensor sublineages and 
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kmeans_mtimes_seeded result on Mafricanum lineage with k=4 
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Figure 3 The clustering plot of M. africanum strains Clustering plot of M. africanum strains using Principal Component Analysis on the 
score matrix obtained from the PARAFAC model. Four putative tensor sublineages, MAI to MA4, are clearly distinct along the principa 
component axes. 



SpolDB4 families. Figure 9 shows the clustering plot of 
tensor sublineages and Figure 10 shows spoligotype and 
MIRU type signatures of tensor sublineages. All subli- 
neages are highly stable with stability 1. Much like with 
AFRI and BOV, the generic CAS family is divided across 
all tensor sublineages. C3 only contains CAS strains. 
Disregarding CAS, CI contains most CAS1 DELHI 
strains and all CAS2 strains. C4 contains all CAS1_KILI 
strains. C2 contains 2 CAS1_DELHI strains, but the vast 
majority (331 strains) of CAS1_DELHI strains fall in CI. 
In addition to the common deletions of East-African 
Indian (CAS) strains, C2 strains lack spacer 22, C3 
strains lack spacers 20 through 22, and C4 strains lack 
spacers 20 through 22 and spacer 35. Variabilities in 
MIRU loci 10, 26, 31 and 40 are also key to defining dif- 
ferences in the sublineages. C2 and C3 strains differ by 



variations in MIRU locus 10. C4 strains which include 
all CAS1_KILI strains exhibit a very distinct MIRU sig- 
nature compared to other tensor sublineages, especially 
in MIRU locus 26. 

Sublineage structure of Indo-Oceanic PARAFAC 
found the most stable clusters and it constructs nine 
distinct putative sublineages for Indo-Oceanic, denoted 
IOl to 109, while the dataset has thirteen SpolDB4 
lineages. Table 8 shows the correspondence of tensor 
sublineages and SpolDB4 families. Figure 11 shows the 
clustering plot of tensor sublineages and Figure 12 
shows spoligotype and MIRU signatures of tensor subli- 
neages. The EAI5 family acts much like the CAS, BOV, 
and AFRI families, spreading across all the Indo-Oceanic 
sublineages except 104. The small MANU1 family also 
spreads across four sublineages. The existence of the 
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Figure 4 Biomarker signatures of M. africanum tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of M. africanum 
strains. White indicates probability of 0 and black indicates probability of 1. Intermediate colors represent probabilities in the range (0,1). MAI 
and MA4 are similar in their MIRU signatures, and MA4 strains lack spacers 22 through 24, in addition to the deletions of MA1 strains. MIRU 
signatures of MA2 and MA3 strains are also similar, and MA2 has an extra deletion, 21 through 24, in addition to the deletions of MA3 strains. 



Table 4 Confusion matrix of distinct M. africanum strains 
based on MIRUVNTRplus sublineages 





MA1 


MA2 


MA3 


MA4 


West African 1 


0 


5 


0 


0 


West African 2 


21 


0 


0 


16 


Unspecified 


2 


13 


7 


0 



Confusion matrix for 64 distinct M. africanum strains showing the 
correspondence between the West African 1 and 2 sublineages and tensor 
sublineages. For the data not from MIRU-VNTRp/us, the lineage is indicated as 
unspecified. 



MANU1 family has not been well established by other 
biomarkers. Disregarding these two troubling families, 
the tensor sublineages correspond closely to the 
SpolDB4 families. Table 8 shows that there is almost a 
one-to-one mapping between most SpolDB4 lineages 
and Indo-Oceanic tensor sublineages. Specifically, the 
mapping between the most stable clusters (with subline- 
age stability) and the families are: IOl (.94) equals 
EAI6_BDG1, 102 (1) equals EAI3JND, 104 (1) equals 
ZERO, and 106 (.91) equals most of EAI2_MANILLA. 
All EAI strains are in 109 (.77), all EAI1 strains are in 
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Table 5 Confusion matrix of M. bovis strains 





MB1 


MB2 


MB3 


Stability 1 1 1 


BOV 


7 


5 


5 


BOVIS 1 


0 


0 


29 


B0VIS1_BCG 


0 


0 


1 1 


B0VIS2 


24 


0 


0 


B0VIS3 


0 


21 


0 



Confusion matrix of M. bovis strains clustered into 3 groups using PARAFAC. 
Correct labels are SpolDB4 labels on the rows, and tensor sublineages are 
represented by each column. Stability of 1 for the tensor sublineages 
indicates that they have clear and marked differences based on their 
genotype. MB1 contains all B0VIS2 strains, MB2 contains all B0VIS3 strains, 
and MB3 contains all B0VIS1 and B0VIS1_BCG strains. 



108 (.86), all MICROTI strains are in 105 (0.56), and all 
ZERO strains are in 104. All EAI2_NTB strains are in 
105, all EAI3JND strains are in 102, and all 
EAI8_MDG strains are in 107 (.84). EAI2_MANILLA is 
divided into two sublineages: 11 strains in 105, 265 
strains in 106. While the spoligotype and MIRU signa- 
tures show that there are distinct EAI5 subgroups, the 
definition of the EAI5 and MANU1 groups are not well 
supported by the tensor analysis. They may represent a 
more generic sublineage that is further subdivided. Dis- 
tinct patterns are observable in the spoligotype and 
MIRU signatures for most of the tensor sublineages. 

Sublineage structure of Euro-American Tucker3 
found the most stable clusters and it generates 35 subli- 
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Figure 5 The clustering plot of M. bovis strains Clustering plot of M. bovis strains using Principal Component Analysis. Three putative tensor 
sublineages, MB1 to MB3, are clearly separated. 
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Figure 6 Biomarker signatures of M. bovis tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of M. bovis strains. 
Although MIRU signatures of MB1 and MB2 strains are similar, spoligotype signatures of MB1 and MB2 strains are clearly distinguishable by extra 
deletions of 13 through 14 in all MB2 strains, and deletions of 5 through 7 in some MB2 strains. 



neages for Euro-American, denoted El to E35, while 
there are 33 SpolDB4 lineages labeled Euro-American. 
See additional file 1 for the confusion matrix of Euro- 
American strains that shows the correspondence of ten- 
sor sublineages and SpolDB4 families. Figure 13 shows 
the clustering plot of tensor sublineages. Figure 14 and 
Figure 15 show the spoligotype and MIRU signatures of 
tensor sublineages respectively. 

Strains belonging to families H2, H37Rv, LAM12_- 
MAD1, Tl (Tuscany variant), T1_RUS2, T4, T5_MAD2, 
and T5_RUS1 are clustered in tensor sublineages E9, E7, 
E8, E24, Ell, E34, E34, and E17 respectively. In contrast, 
the Tl family, an ancestor strain family, is distributed 
across 25 tensor sublin-eages, with most Tl strains in 
E34. Sublineage stability is above .90 for 18 tensor subli- 
neages. Spoligotype and MIRU signatures of sublineages 
suggest either subdivision or merging of SpolDB4 
families. For instance, tensor sublineages E2, E6, and 
E32 include Tl strains only. In addition to common 
spacer deletions of Euro-American strains, E2 strains 



Table 6 Confusion matrix of East Asian (Beijing) strains 





B1 


B2 


B3 


B4 


B5 


B6 


Stability 111111 


BEIJING 


468 


0 


0 


18 


0 


41 


BEIJING-LIKE 


0 


16 


8 


0 


20 


0 



Confusion matrix of East Asian (Beijing) strains clustered into 6 groups using 
Tucker3. Correct labels are SpolDB4 labels on the rows, and tensor 
sublineages are represented by each column. The six highly stable tensor 
sublineages are indicative of additional genetic diversity within the BEIJING 
and BEIJING-LIKE sublineages. 



lack spacers 15 through 26, E6 strains lack spacers 9 
through 23, and E32 strains lack spacers 1 through 19, 
which are all variations in spoligotype signatures of Tl 
strains. This sublineage classification further subdivides 
the poorly-defined ancestor Tl family. Strains of LAM 
families on the other hand are grouped in 17 tensor 
sublineages. Prior studies have found that LAM Rio 
strains identified by SNPs are found in multiple 
SpolDB4 lineages [20]. Therefore, it is expected that the 
use of multiple biomarkers leads to subdivision or mer- 
ging of some SpolDB4 families. 

Although most stable clusters of the Euro-American 
strain dataset are found using best-match stability, the 
DD-weighted gap statistic plot has multiple peaks. DD- 
weighted gap statistic, detailed in the methods section, 
is a cluster validity measure which is also used for 
detecting hierarchical structure in the datasets. Multi- 
ple peaks in DD-weighted gap statistic plot suggest 
that the Euro-American dataset may have a multilevel 
hierarchical structure. Model order selection with ran- 
domized maps by Bertoni and Valentini can be used to 
detect the hierarchical structure in the Euro-American 
dataset [21]. 

We used the unsupervised tensor clustering frame- 
work to cluster MTBC strains of major lineages into 
sublineages. Next, we turn our attention to supervised 
tensor learning methods on multiple-biomarker tensors 
to classify strains into major lineages. 
Classification of MTBC strains into major lineages using 
two-way and multiway supervised learning 
Multiple-biomarker tensors can be used in supervised 
classification models as well as in unsupervised models. 
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kmeans_mtimes_seeded result on EastAsian lineage with k=6 
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Figure 7 The clustering plot of East Asian (Beijing) strains Clustering plot of East Asian (Beijing) strains using Principal Component Analysis. 
Six putative tensor sublineages, B1 to B6, are clearly distinct. 



We use multiway partial least squares (N-PLS) on mul- 
tiple-biomarker tensors to predict major MTBC 
lineages [22]. In our experiments, we used spoligotype 
and MIRU as biomarkers and predicted the six major 
lineages using the same data as for the above unsuper- 
vised learning experiments combined into a single 
dataset. More specifically, we used 12 spoligotype dele- 
tions found informative in major lineage classification 
combined with 12-loci MIRU [23]. We predicted major 
lineages with the N-PLS multiway method and com- 
pared it with standard two-way PLS and prior results 
for conformal Bayesian Networks [4]. Table 9 shows 
the average testing F-measure as estimated by 5-fold 
cross-validation. We generate the multiple-biomarker 
tensor using 12 spoligotype deletions and 12-loci 
MIRU with one additional bit indicating whether the 
at least one MIRU pattern includes letter rather than 



number of repeats, and create a predictive model using 
the N-PLS multiway method. The model for standard 
2-way PLS is created by representing the data as a 
matrix with columns corresponding to 12 spoligotype 
deletions and 12-loci MIRU with the additional indica- 
tor bit, and rows corresponding to MTBC strains. The 
number of latent variables for both N-PLS and PLS are 
selected by inner 4-fold cross-validation of the training 
set data only. 

We compare N-PLS, standard PLS and Conformal 
Bayes Network (CBN) methods by F-measure of major 
lineage classification and see that they are accurate pre- 
dictive models with no significant difference between 
the approaches. Table 9 shows the F-measure values for 
N-PLS, standard PLS and CBN. The average F-measure 
of major lineage prediction on the same data using the 
CBN is 0.9897 [4]. This shows that N-PLS and standard 
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Figure 8 Biomarker signatures of East Asian (Beijing) tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of East 
Asian (Beijing) strains. Tensor sublineages B1, B4, B6 include BEIJING strains and sublineages B2, B3, B5 include BEIJING-LIKE strains. 



PLS methods predict major lineages as accurately as 
CBN, with a slightly better average F-measure value. All 
three methods achieve outstanding results for major 
lineage classification with no significant difference 
between approaches. 



Table 7 Confusion matrix of East-African Indian (CAS) 
strains 





C1 


C2 


C3 


C4 


Stability 1111 


CAS 


50 


21 


35 


1 


CAS1_DELHI 


331 


2 


0 


0 


CAS1_KILI 


0 


0 


0 


23 


CAS2 


45 


0 


0 


0 



Confusion matrix of East-African Indian (CAS) strains clustered into 4 groups 
using Tucker3. Correct labels are SpolDB4 labels on the rows, and tensor 
sublineages are represented by each column. 



Conclusions 

This study investigates multiple-biomarker tensors and 
illustrates how they can be used for both unsupervised 
and supervised learning models. First, a novel clustering 
framework is used to analyze the sublineage structure of 
MTBC strains based on multiple biomarkers. We gener- 
ated multiple-biomarker tensors to represent multiple 
biomarkers of the MTBC genome and used multiway 
models for dimensionality reduction. The multiway 
representation determines a transformation of the data 
that captures similarities and differences between strains 
based on two distinct biomarkers. We clustered MTBC 
strains based on the transformed data using improved 
k-means clustering and validated clustering results. We 
evaluated the sublineage structure of major lineages of 
MTBC and found similarities and clear distinctions in 
our subdivision of major lineages compared to 
the SpolDB4 classification. Simultaneous analysis of 
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Figure 9 The clustering plot of East-African Indian (CAS) strains Clustering plot of East-African Indian (CAS) strains using Principal 
Component Analysis. Four putative tensor sublineages, CI to C4, are dearly distinct. 



spoligotype and MIRU through multiple-biomarker ten- 
sors and clustering of MTBC strains leads to coherent 
sublineages within major lineages with clear and distinc- 
tive spoligotype and MIRU signatures. Second, we 
demonstrated how the multiple-biomarker tensor can be 
used to predict major lineages with extremely high accu- 
racy competitive with other approaches. We show that 
3-way PLS, 2-way PLS and CBN models are accurate 
major lineage predictors for MTBC strains. 

The tensor clustering framework is flexible and can be 
applied to any multidimensional strain data. The design 
of the resulting tensor depends on the question to be 
answered. In this study, multiple-biomarker tensors are 
designed to find groups of MTBC strains. Thus, the 
application of the tensor clustering framework on multi- 
ple-biomarker tensors leads to sublineages of MTBC 



within major lineages. The multiple-biomarker tensor is 
further validated by the fact that it can used to predict 
known major lineages with high accuracy using N-PLS. 
N-PLS with multiple-biomarker tensors can be used for 
semi-supervised learning as well. This can be useful for 
learning predictive models for sublineages in which only 
part of the data is labeled with sublineages and the 
other part of the data has no labels. This may result in 
more reliable and accurate classifiers of MTBC subli- 
neages, and the resulting sublineage classifiers would be 
a significant enhancement to TB control, epidemiology 
and research. We leave this to future work. 

The tensor clustering framework used in this study 
can be further extended to find subgroups of MTBC 
strains based on other biomarkers such as RFLP and 
SNPs. 15-loci MIRU and 24-loci MIRU patterns can 
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Figure 10 Biomarker signatures of East-African Indian (CAS) tensor sublineages Spoligotype and MIRU signatures of tensor sublineages of 


East-African Indian (CAS) strains. In addition to deletions in 


CI strains, C2 strains lack spacer 22. In addition to deletions in C3 strains, C4 strains 


lack spacer 35 and have 
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also be used to represent MTBC genomes with 
multiple-biomarker tensors. Moreover, more than two 
biomarkers can be used in the MTBC genome represen- 
tation. But, ambiguity in the tensor entries is an open 
question that needs to be solved in the tensor represen- 
tation when more than two biomarkers are used. Addi- 
tion of new biomarkers will increase the number of 
modes of the multiple-biomarker tensor, but the multi- 
way analysis methods will remain the same. 

Other questions of interest can be addressed by 
designing and analyzing host-pathogen tensors to 



examine the relationship of the pathogen genotype 
with host (or equivalent) attributes to examine 
questions of interest. For example, since the MTBC 
sublineages are known to be highly geographically 
dependent, a tensor which combines the pathogen gen- 
otype with the country of birth of the host may reveal 
additional sublineage structure and transmission pat- 
terns. A tensor combining MTBC genotype and host 
disease phenotype such as site of infection and drug 
resistance could be used to analyze MTBC genotype/ 
phenotype relations. 
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Table 8 Confusion matrix of Indo-Oceanic strains 
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Confusion matrix of Indo-Oceanic strains clustered into 9 groups using 
PARAFAC. Correct labels are SpolDB4 labels on the rows, and tensor 
sublineages are represented by each column. SpolDB4 lineages except EAI5 
and MANU1 map to distinct tensor sublineages. 



Methods 

Tensor Clustering Framework (TCF) 

Clustering MTBC strains based on multiple-biomarker 
tensors consists of a sequence of steps. First, we find 
informative feature set of spoligotype deletions and gen- 
erate a tensor. Second, we apply multiway models on 
the tensor and get a score matrix for the strain mode. 
Third, we use this score matrix to determine the simi- 
larity between strains, and cluster them using a stable 
version of k-means. In the final step, we evaluate the 
clustering results using cluster validity indices. This 
stepwise clustering framework is outlined in Figure 16. 
We describe the steps of the tensor clustering frame- 
work in this section. 
Datasets 

The dataset comprises 6848 distinct MTBC strains as 
determined by spoligotype and 12-loci MIRU, labeled 
with major lineages and SpolDB4 families. The strains 
are mainly from the CDC dataset - a database col- 
lected by the CDC from 2004-2008 labeled with the 
major lineages collected by the TB-Insight project 
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Figure 11 The clustering plot of Indo-Oceanic strains Clustering plot of Indo-Oceanic strains labeled by putative tensor sublineages using 
Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, implying that 
the tensor sublineages are well distinguished in the PCA plot if they are stable. 
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Figure 13 The clustering plot of Euro-American strains Clustering plot of Euro-American strains labeled by 35 tensor sublineages using 
Principal Component Analysis. The tensor sublineages are not as distinct as they were for the previously analyzed major lineages, reflecting the 
variability in the tensor cluster stability. It may also be due to the anticipated hierarchical structure in Euro-American strains. 



(http://tbinsight.cs.rpi.edu/) that was previously studied 
in [4]. We also used the MIRU-VNTR/7/ws dataset 
from www.MIRUVNTRplus.org which is labeled with 
SpolDB4 lineages and sublineages. The original 
SpolDB4 labeled dataset provided in an online supple- 
ment [6] contains only spoligotypes. We found all 
occurrences of these spoligotypes in the CDC and 
MIRU-VNTRp/ws dataset and constructed a database 
with spoligotype and MIRU patterns, with major 
lineages as determined by CDC, and sublineages as 
given in the SpolDB4 database [6]. The numbers of 
strains for each major lineage in the resulting dataset 
are shown in Table 10. We created 6 datasets from the 
CDC+MIRU-VNTRp/ws dataset, one for each major 
lineage. These same 6 major lineage datasets were 



merged into one for the supervised learning 
experiment. 

Feature Selection and Tensor Generation 
Feature Selection The spoligotype pattern captures the 
variability in the DR locus of the MTBC genome. A 
spoligotype consists of 43 spacers represented as a 43- 
bit binary sequence, and according to the hidden par- 
ent assumption, one or more contiguous spacers can 
be lost in a deletion event, but rarely gained [8,24]. 
Therefore, there are possible deletions of lengths vary- 
ing from 1 to 43 in a spoligotype. Only subsets of spo- 
ligotypedeletions are required for effective 
discrimination of MTBC strains. A set of 12 deletion 
sequences of spoligotypes reported by Shabbeer et al. 
have proven to be good discriminator spacer deletions 
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Figure 14 Spoligotype signatures of Euro-American tensor 
sublineages Spoligotype signatures of tensor sublineages of Euro- 
American strains. 



for major lineage classification [23]. These 12 deletion 
sequences are used in the supervised learning study. 
Another set of 81 deletion sequences of spoligotypes 
reported by Brudey et al. have proven to be good dis- 
criminator spacer deletions for SpolDB4 sublineage 
classification [6]. 

Within the TCF, we built a feature selection algo- 
rithm to find spacer deletions that are informative. 



This insures that the results are not biased by a priori 
selection of spoligotype deletions. Given a set of spoli- 
gotypes, we first calculate the frequency/, i = X,„, 946, 
of each possible deletion among the spoligotypes of 
strains. If / = 1, the deletion is a common deletion. If 
0 < fi <threshold, the deletion is a nonexistent deletion, 
where threshold is data dependent and threshold = 
0.05 is used by default. The deletions with frequency 
fi such that threshold < f t i < 1 are uncommon deletions. 
In the second step, we iterate through the set of 
uncommon deletions U, and remove an uncommon 
deletion we U, if there exists a common deletion c e 
C which is a substring of u. We assign the final set of 
uncommon deletions as the feature set. Using the final 
feature set, we determine spoligotype deletions that are 
effective in discriminating the strains of the dataset. 
Algorithm 1 summarizes the feature selection proce- 
dure. Numbers of spoligotype deletions for each major 
lineage, found informative by the feature selection 
algorithm, are given in Table 10. 

Algorithm 1 FeatureSelection(Spoligotypes, th) 

1: // Classify all possible spoligotype deletions according to their frequency/, 

-0 < f t < th: Nonexistent deletions (N) 

-th < /,- < 1: Uncommon Deletions (U) 

-f i = 1: Common deletions (C) 
where th is the upper bound of frequency for nonexistent deletions. 
2: // Remove uncommon deletions which are a superset of a common deletion 
3: for each uncommon deletion u e U do 
4: if 3c g C which is a substring of u then 
5: Remove u from uncommon deletions: U = U \ {u} 
6: end if 
7: end for 

8: Return uncommon deletion set U as the final feature set. 

Tensor Generation We generated multiple-biomarker 
tensors using two biomarkers, spoligotype deletions and 
MIRU patterns, as explained earlier. The spoligotype 
deletions found informative by the feature selection 
algorithm are used in the generation of multiple- 
biomarker tensors. The multiple-biomarker tensor is of 
the form Strains x Spoligotype deletions x MIRU 
patterns. We used the tensor clustering framework on 
multiple-biomarker tensors to cluster strains. 
Multiway modeling 

Multiway models are needed to fit a model to multiway 
arrays. We used PARAFAC and Tucker3 techniques to 
model the tensors. We determined the number of com- 
ponents for each model to ensure a bound on the 
explained variance of data. 

Multiway models We used PARAFAC and Tucker3 
models to explain the tensor with high accuracy. Multi- 
way modeling of tensors was carried out using the 
n-wayToolbox of MATLAB by Bro et al. and the PLS 
toolbox[25,26]. 
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Table 9 Multiway N-PLS and standard two-way PLS 


Table 10 Data statistics by major lineage 




classification accuracy results 


Major lineage 


# Strains # Spoligotype deletions 


Method Average F-measure 


M. africanum 


64 


22 


N-PLS 0.9961 ± 0.0009 


M. bovis 


102 


34 


Standard PLS 0.9955 ± 0.001 7 


East Asian (Beijing) 


571 


5 


Conformal Bayes Net 0.9897 


East-African Indian(CAS) 


508 


18 


Multiway N-PLS and standard two-way PLS classification accuracy results 
when 12 spoligotype deletions and MIRU patterns are used to classify MTBC 
strains into major lineages. The excellent results compare favorably to prior 


Indo-Oceanic 


1023 


28 


Euro-American 


4580 


109 


results based on a conformal Bayesian Network in [4]. 


Numbers of strains in each major lineage of CDC+MIRU-VNTRp/us dataset and 




numbers of spoligotype delet 


ons identified by the feature selection 



PARAFAC 

PARAFAC is a generalization of singular value decom- 
position to multiway data [27,28]. A 3-way array X e 
R /x/xJC is moc j e i ec j by an 7?-component PARAFAC model 
as follows: 



X 



I 



r=l 



Feature Selection and 
Tensor Generation 



Multiway Modeling 

1. PARAFAC 

2. Tucker3 



> 




Clustering Algorithm 

kmeans mtimes seeded 







Cluster Validation 

Best-Match Stability 
DD-weighted Gap Statistic (PC) 



Figure 16 Tensor clustering framework Clustering framework of 
MTBC strains. High-dimensional genotype data are decomposed 
into two-dimensional arrays using multiway models, which are then 
used as input to the kmeans_mtimes_seeded algorithm. Clusterings 
are validated using best-match stability. In case of a tie, the DD- 
weighted gap statistic is used to pick the number of clusters. 



algorithm. 



where A e R'* K , B e R , C e R 1KXK are component 
matrices of first, second, and third mode. G e ji RxRxR j s 
the core array, and E e R Ix,xK is the residual term con- 
taining all unexplained variation. A description of the 
PARAFAC model is shown in Figure 17. 

The PARAFAC model is symmetric in all modes and 
the number of components in each mode is the same 
[29]. The PARAFAC model is a simple model, which 
comes with a restriction of the equality on the number 
of components in each mode which makes it difficult to 
fit a data array with the PARAFAC model. One advan- 
tage of the PARAFAC model is its uniqueness: fitting 
the PARAFAC model with the same number of compo- 
nents to a given multiway dataset returns the same 
result. 
Tucker3 

Tucker3 is an extension of bilinear factor analysis to 
multiway datasets [30]. A 3-way array X e R Ix ^ xK is 
modeled by a (.P,Q,i?)-component Tucker3 model as fol- 
lows: 



Q R 



X 



ijk 



t^pqr' AipBjqCfr + Ep 



p=l q=l r=l 



where A e R IxP , B e R 7 * Q , 
nent matrices of first, second and third modes respec 



, C e R*** are the compo- 



tively. G e R 



PxQxR 



the core array and E e R 



Ix/xK 



the residual term. A description of the Tucker3 model is 
shown in Figure 18. 

Tucker3 is a more flexible model compared to PAR- 
AFAC. This flexibility is due to the core array G, which 
allows interaction of any factor in a mode with any 
other factor in other modes [31]. Therefore, the number 
of components for each mode can be different. This 
results in indeterminacy of the Tucker3 model, since it 
cannot determine the component matrices uniquely. 

Model validation A multiway model is appropriate if 
adding more components to any mode does not 
improve the fit considerably. There is a tradeoff between 
the complexity of the model and the variance of the 
data explained by the model. Therefore, validation of a 
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Figure 17 PARAFAC model PARAFAC model of a three-way array X with R components. The tensor is modeled as a linear combination of 
rank-one tensors for each mode. 



model also determines a suitable complexity for the 
model. We used the core consistency diagnostic (COR- 
CONDIA) to determine the number of components of 
the PARAFAC model [32]. The core consistency diag- 
nostic measures the similarity of the core array G of the 
model and the superdiagonal array of ones. Core consis- 
tency is always less than or equal to 100% and may also 
be negative. As a rule of thumb, Bro et al. suggests that 
a core consistency above 90% implies a trilinear model 
[32]. In our experiments, we kept core consistency 
above 90%, while still explaining the variance of the data 
as much as possible with a trilinear model. We deter- 
mined the number of components of the Tucker3 
model by rank reduction on the unfolded tensor along 
each mode, and these components explain over 90% of 
the variance of the data. 
Clustering algorithm 

We developed the kmeans_mtimes_seeded algorithm, a 
modified version of the k-means algorithm, to group 
MTBC strains based on the score matrices of the multi- 



way models. K-means is a commonly used clustering 
algorithm with two weaknesses: 1) Initial centroids are 
chosen randomly, 2) The objective value of k-means, 
measured as within-cluster sum of squares, may con- 
verge to local minima, rather than finding the global 
minimum. We solve these problems with two improve- 
ments: 1) Initial centroids are chosen by careful seeding, 
using a heuristic called kmeans++, suggested by Arthur 
et al. [33]. Let D(x) represent the shortest Euclidean dis- 
tance from data point X to the closest center already 
chosen. kmeans++ chooses a new centroid at each step 
such that the new centroid is furthest from all chosen 
centroids. Algorithm 2 summarizes the kmeans++ pro- 
cedure. 2) The local minima problem is partially solved 
by repeating the k-means algorithm multiple times and 
retrieving the run with the minimum objective value. 
We repeated the algorithm m = 20 times. The 
kmeans_mtimes_seeded algorithm combines these two 
improvements, as summarized in algorithm 3. The 
kmeans_mtimes seeded algorithm is more stable 
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Figure 18 Tucker3 model Tucker3 model of a three-way array X with (P,Q,R) components at each mode. The tensor is decomposed into 
component matrices A, B, C, core array G, and residual array E. 
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compared to the k-means algorithm, and produces more 
accurate clusters. 



a cluster C, compared to a model clustering , is calcu- 
lated as: 



Algorithm 2 kmeans++(A,k) 
1: Pick the rst centroid c^ at random: InitCentroids = {c x } 
2: for i = 2 to k do 

3: Find D(a), distance to the closest centroid picked so far, for each data point ae A 
4: Pick the data point a with maximum D(a) as new centroid 
c i = argmaxD(fl) 

5: Add c ; to the set of initial centroids: InitCentroids = InitCentroids u {c^} 
6: end for 

7: Run kmeans(A,k) with InitCentroids 



Algorithm 3 kmeans_mtimes_seeded(A,k,m) 
1 : for i = 1 to m do 
2: kmeans++(A,k) 

3: Get the objective value of k-means run i 
4: end for 

5: Pick the k-means run with the minimum objective value 



Cluster Validation 

Clustering results for the MTBC strains are evaluated to 
determine the best choice for the number of clusters 
and compare the chosen clustering with existing subli- 
neages using cluster validity indices. We used the best- 
match stability to pick the most stable clusterings. In 
case of a tie in average best-match stability, we used the 
DD-weighted gap statistic for cluster validation [34]. We 
compare our clusters to an existing classification using 
the F-measure. 

Best-Match Stability The stability of a clustering is 
measured by the distribution of pairwise similarities 
between clusterings of subsamples of the data. The 
idea behind stability is that if we repeatedly sample 
data points and apply the same clustering algorithm to 
the subsample, then an effective clustering algorithm 
applied to well separated data should produce cluster- 
ings that do not vary much for different subsamples 
[35]. In such cases, the algorithm is stable independent 
of input randomization. We use best-match stability as 
suggested by Hopcroft et al. [36] to assess stability. 
The algorithm clusters the same data multiple times, 
and compares the reference cluster to model cluster- 
ings. We used 25 model clusterings to compare with 
the reference cluster. The stability of each cluster is 
calculated by finding the average best match between 
this cluster and the clusters identified using other 
model clusterings. High average best-match values 
denote that the two clusters have many strains in com- 
mon and are of roughly the same size [8]. We also cal- 
culate the average best-match of a clustering by 
finding the average of best-match values for all clusters 
in the reference clustering. Best-match stability of 



best _ match 

where 
match(C,C) ■ 



max match(C,ref CA 

i=\,..,k 



CnC'l 



max(|C|,|C'|) 



and refCi is the set of items in reference cluster i. 

DD-Weighted Gap Statistic (PC) Tibshirani et al. 
proposed a cluster validity index called the gap statistic, 
which is based on the within-cluster sum of squares 
(WCSS) of a clustering [37]. Let the dataset be X e 
R"*^ consisting of n data points with p dimensions. Let 
dij be the Euclidean distance between data points i and 
/. After clustering this dataset, suppose that we have 
k clusters Q, .., Q, where C, denotes the indices of data 
points in cluster i, of size «; = | Q | . The sum of within- 
cluster pairwise distances for cluster r is defined as: 



i.jeC,. 



and the within-cluster sum of squares for a clustering 
is defined as: 



k 

1=1 



The idea of the gap statistic method is to compare 
and its expected value under a reference distribution of 
the dataset. Therefore, the gap value is defined as: 

Gap n {k) = E* n {log(W k )}-\og{W k ) 

Where E n represents the expected value under a 
sample of size n based on a reference distribution. The 
optimal number of clusters is the value fe for which 
Gap n (k) is maximized. The selection of number of clus- 
ters via gap statistic is summarized in [37]. 

The reference distribution can be one of two choices: 
uniform distribution (Gap/Unif), or a uniform distribu- 
tion over a box aligned with the principal components 
of the dataset (Gap/PC). Experiments by Tibshirani 
et al. show that Gap/PC finds the number of clusters 
more accurately, therefore we used Gap/PC in this 
study [37]. 

The gap statistic is a powerful method for estimating 
the number of clusters in a dataset. However, a study by 
Dudoit et al. showed that the gap statistic does not 
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estimate the correct number of clusters for every case contingency table in Table 11, precision, recall, and 
[38]. This may be because increases as the number F-measure are defined as: 
of data points increases. Hierarchical structure of the 
data may also cause problems. The data may be com- fl 
posed of nested clusters and the gap statistic will be 
capturing only the minimum of these candidate num- 
bers of clusters. Yan et al. suggested a 2-step improve- 
ment to the gap statistic, called the DD-weighted gap a + b 
statistic [39]. They defined average within-cluster pair- p _ 
wise distances for cluster r as follows: P + R 



a + c 
a 



D = ^ 

U < 2n r (n r -l) 

and the weighted within-cluster sum of squares W k 
as: 

k k 

r=l r=i ry r ' 

Based on W k ,the weighted gap statistic Gap n (k) is 
defined as 

G^ n (fe) = £;{log(W fe )}-log(W fe ) 

Let DGap n {k) denote the difference in Gap n {k) 
w hen the number of clusters is raised from k-1 to k. 
DGap n {k) is defined as 

DGap n (k) = Gapjk) - Gap n {k - 1) 

DGap n (k) > 0 for fe < fe , and otherwise it will be 
close to zero. Therefore, to find a "knee" point in the 
plot, they i ntro duce a second difference equation and 
define D DGap n {k) as 

D DGap n {k) = DGap n {k) - DGapJk + 1) 

= 2Gap n {k) - Gap n (k - 1) - Gap n {k + 1) 

D DGap n (k) is maximized when k is equal to 
the true number of clusters. The advantage of 
D DGap n [k) over the gap statistic i s th at there may 
be multiple peaks in the plot of D DGap n (k) and this 
may indicate a hierarchical structure in the data. In 
such cases, multilayer analysis should be used instead 
of a single step procedure. 

F-measure The F-measure is a weighted combination 
of precision and recall of a clustering. Since the F-mea- 
sure combines precision and recall of clustering results, 
it has proven to be a successful metric. We use the F- 
measure to evaluate how similar the tensor sublineages 
are to the SpolDB4 families. According to the 



Multiway Partial Least Squares Regression (N-PLS) 

N-PLS is a multiway regression method where at least 
one of the independent and dependent blocks has at 
least three modes created by Bro et al. by generalizing 
PLS to multiway data [22]. Consider independent vari- 
ables in the X-block, X e R , and dependent vari- 
ables in the Y-block, Y e R .In our experiments, the 
X-block is a three-way array and the Y-block is a two- 
way array. The multiway array X is decomposed using a 
matricized version X e R IxJK as: 

X=t(w I< ® w 1 )' + E (1) 

and the two-way array Y is decomposed as: 

Y = uq' + F (2) 

where t e R /xl and u e R /xl are score vectors of X 
and Y. w' e R /xl and w K e R /<fxl are the loading vectors 
(weights) of the second and third modes of X respec- 
tively, q e R Mxl is the loading vector of Y. E e R /x/ff 
and F e R /xM are the residuals of X and Y respectively. 

Notice that the two-way array Y is decomposed into one 
score and one loading vector, whereas the matricized 
three-way array X is decomposed into one score and two 
loading vectors, w' and w K . This is the main difference 
between N-PLS and PLS. At each iteration of N-PLS, a 
new PLS component is added. If n PLS components are 
used, X is decomposed into component matrices T e 
A 1 *", W J e R /XM , W K e R KxH , and Y is decomposed into 
component matrices U e R /xn , Qe R Mx ". 

The aim of N-PLS is to maximize the covariance of X 
and Y. For this purpose, we define an inner relation 



Table 11 Contingency table 





Same cluster 


Different clusters 


Same class 


a 


b 


Different classes 


c 


d 



Contingency table of a clustering, where rows represent true classes and 
columns represent found clusters. Given n data points in the dataset, 



a + b + C + d = 

2 

V /■ 
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linking the X and Y blocks, using their score matrices, T 
and U: 
U = TB + E u (3) 

This requires finding loading vectors w' and w K such 
that the covariance of t and y are maximized: 



A = max 



/ 7 K 



1 j=1 k=\ 



cov(f, y) | min 
/ / k 

i=l j=l k=l 
1 K 



j=l k=l 



where Ze vJ xK is a matrix with and 
\\ w 1 1=| | u/ ^ ||=1. To maximize this expression, we 

write it in matrix notation: 

A = max((w')'Zw K )=>(w',w K ) = SV D(Z) 



and Y-block are done according to centering and scaling 
methods explained in [44]. 



Algorithm 4 Tri-PLS2(X e I 



1: X 

2 



(l) 

yo=Y(:,l) 
for i - 1 to N do 
repeat 

Calculate matrix Z e ] 



such that z 



6: Compute SVD of Z: Z = U I V 

7: Calculate loading vectors as first left and right singular vectors of Z: 

w 1 =U(:;1) w K = V(:; 1) 
8: Calculate score vector t 

t^o.o-x^tw^w 1 ) 

9: q=Y*T/[Y*T| 
10: y^Yq 
11: until y iA converges 
12: Calculate regression coefficient b: 

b = B(:,i) = (T*rr 1 T'y i _ 1 
13: Deflate X and Y 

X f =X M -t(w K g>w')' 

Y = Y - Tbq' 
14:end for 



The problem of finding w' and w K is simply solved by 
SVD on Z[22,40]. w J and w K are first left and right sin- 
gular vectors of Z. To reconstruct Y, we substitute (3) 
in equation 2: 



Y = (TB + EJQ'+F 

Y = TBQ + E U Q'+ F 

Y = TBQ '+ F * 



(4) 



Given X and its decomposition matrices, we can make 
predictions for a new X-block, using equation 4. The 
derivation of the full and closed predictions with N-PLS 
has been presented by Smilde et al. [41]. Three alterna- 
tive methods are proposed by De Jong et al. for deriva- 
tion of training models via regression coefficients [42]. 
Bro et al. proposed an improved N-PLS method with 
better fit of the independent data, keeping regression 
coefficients and predictions the same [43]. 

The N-PLS model of a multiway array is a multilinear 
model, like PARAFAC, which means that it has no rota- 
tional freedom. Therefore, the N-PLS model of a multi- 
way array is unique. In this study, we used a 3-way 
array as the X-block and a 2-way array as the Y-block, 
therefore we are particularly working on the Tri-PLS2 
version of N-PLS, which is summarized in Algorithm 4. 
The term X^) in the algorithm refers to X matricized 
along the first mode. The X-block and Y-block are cen- 
tered and scaled prior to application of the algorithm. 
The preprocessing and postprocessing of both X-block 



Additional material 



Additional file 1: Confusion matrix of Euro-American strains The 

confusion matrix of Euro-American strains that shows the 
correspondence of tensor sublineages and SpolDB4 families. Each row 
represents SpolDB4 families and each column represents tensor 
sublineages. 
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